The following report looks at a sleep study dataset and inspects the column values, plots the correlation between multiple variables, and takes various random samples.

Review of dataset columns

glimpse(as_tibble(data))
## Rows: 374
## Columns: 14
## $ Person.ID               <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
## $ Gender                  <chr> "Male", "Male", "Male", "Male", "Male", "Male"…
## $ Age                     <int> 27, 28, 28, 28, 28, 28, 29, 29, 29, 29, 29, 29…
## $ Occupation              <chr> "Software Engineer", "Doctor", "Doctor", "Sale…
## $ Sleep.Duration          <dbl> 6.1, 6.2, 6.2, 5.9, 5.9, 5.9, 6.3, 7.8, 7.8, 7…
## $ Quality.of.Sleep        <int> 6, 6, 6, 4, 4, 4, 6, 7, 7, 7, 6, 7, 6, 6, 6, 6…
## $ Physical.Activity.Level <int> 42, 60, 60, 30, 30, 30, 40, 75, 75, 75, 30, 75…
## $ Stress.Level            <int> 6, 8, 8, 8, 8, 8, 7, 6, 6, 6, 8, 6, 8, 8, 8, 8…
## $ BMI.Category            <chr> "Overweight", "Normal", "Normal", "Obese", "Ob…
## $ Blood.Pressure          <chr> "126/83", "125/80", "125/80", "140/90", "140/9…
## $ Heart.Rate              <int> 77, 75, 75, 85, 85, 85, 82, 70, 70, 70, 70, 70…
## $ Daily.Steps             <int> 4200, 10000, 10000, 3000, 3000, 3000, 3500, 80…
## $ Sleep.Disorder          <chr> "None", "None", "None", "Sleep Apnea", "Sleep …
## $ Disorder.Exists         <chr> "No", "No", "No", "Yes", "Yes", "Yes", "Yes", …

Frequency of gender in data

Males and females are nearly equally represented in the dataset

genders = as.data.frame(table(data$Gender)); genders
##     Var1 Freq
## 1 Female  185
## 2   Male  189
gender_chart = plot_ly(genders, labels = ~Var1, values = ~Freq, type = "pie")
gender_chart %>% layout(title = "Proportion of Males and Females in Dataset")

Frequency of occupation in data

There are 11 occupations listed, with the majority of people working as nurses, doctors and engineers. The least represented occupations are managers, sales representatives and scientists/software engineers.

occupation = as.data.frame(table(data$Occupation)); occupation
##                    Var1 Freq
## 1            Accountant   37
## 2                Doctor   71
## 3              Engineer   63
## 4                Lawyer   47
## 5               Manager    1
## 6                 Nurse   73
## 7  Sales Representative    2
## 8           Salesperson   32
## 9             Scientist    4
## 10    Software Engineer    4
## 11              Teacher   40
occ_chart = plot_ly(occupation, x = ~Var1, y = ~Freq, type = "bar")
occ_chart %>% layout(title = "Frequencies of Occupation in Dataset",
                     xaxis= list(title = "Occupation"))

Distribution of ages in data

The minimum age is 27, the median is 43, and the maximum is 59. There are no outliers in age. There are slightly more younger people than older represented, excluding the median.

f = fivenum(data$Age)
names(f) = c("Min", "Q1", "Median","Q3", "Max"); f
##    Min     Q1 Median     Q3    Max 
##     27     35     43     50     59
paste("The number of people younger than the median age:", sum(data$Age < f[3]))
## [1] "The number of people younger than the median age: 186"
paste("The number of people older than the median age:", sum(data$Age > f[3]))
## [1] "The number of people older than the median age: 154"
age_chart = plot_ly(x = data$Age, type = "box")
age_chart %>% layout(title = "Distribution of Ages",
                       xaxis = list(title = "Age"))

Comparison of BMI Category and Sleep Disorder

Most people do not have a sleep disorder, and a similar number of people suffer from Insomnia and Sleep Apnea. It looks like there may be a correlation between sleep disorders and weight. Among the people with no sleep disorder, the majority also have a normal BMI, with about 5% of them being overweight. However, the majority of people with sleep disorders are also overweight, and some are obese.

sleep_vs_BMI = table(data$Sleep.Disorder, data$BMI.Category); sleep_vs_BMI
##              
##               Normal Obese Overweight
##   Insomnia         9     4         64
##   None           200     0         19
##   Sleep Apnea      7     6         65
df = as.data.frame(sleep_vs_BMI)
chart = plot_ly(df, x = ~Var1, y = ~Freq, color = ~Var2, type = "bar")
chart %>% layout(title = "Comparison of People with Sleep Disorders Grouped by BMI Category",
                 yaxis = list(title = "Number of People"), 
                 xaxis = list(title = "Sleep Disorder"),
                 barmode = "stack")

This can be seen more clearly when grouping sleep disorders as existing (Yes) or not existing (No)

disorder_vs_BMI = table(data$Disorder.Exists, data$BMI.Category)
df2 = as.data.frame(disorder_vs_BMI)
chart = plot_ly(df2, x = ~Var1, y = ~Freq, color = ~Var2, type = "bar")
chart %>% layout(title = "Comparison of People with Sleep Disorders Grouped by BMI Category",
                 yaxis = list(title = "Number of People"), 
                 xaxis = list(title = "Sleep Disorder Exists"),
                 barmode = "stack")

If the data is grouped by a sleep disorder existing and the average hours of sleep, we can see that BMI has a large effect on sleep. People with a normal BMI have the highest average sleep duration, even for those that also have a sleep disorder. People that are overweight or obese have lower sleep durations, with the lowest for people that have a sleep disorder and are overweight.

bmi_and_dis_avgsleep = data |>
  group_by(data$Disorder.Exists, data$BMI.Category) |>
  summarise(avgsleep = mean(Sleep.Duration)); bmi_and_dis_avgsleep
## # A tibble: 5 × 3
## # Groups:   data$Disorder.Exists [2]
##   `data$Disorder.Exists` `data$BMI.Category` avgsleep
##   <chr>                  <chr>                  <dbl>
## 1 No                     Normal                  7.41
## 2 No                     Overweight              6.8 
## 3 Yes                    Normal                  7.09
## 4 Yes                    Obese                   6.96
## 5 Yes                    Overweight              6.77

Comparison of Sleep Duration, Physical Activity and Gender

Women get slightly more sleep than men, with an average of 7.23 hours a night compared to 7.04

data |> group_by(data$Gender) |> summarize(avg_sleep = mean(Sleep.Duration))
## # A tibble: 2 × 2
##   `data$Gender` avg_sleep
##   <chr>             <dbl>
## 1 Female             7.23
## 2 Male               7.04

There may be a correlation to higher physical activity levels and greater sleep duration. Countering this correlation are two groupings of women who got either the lowest amount of physical activity and highest sleep, or the highest physical activity level and lowest sleep

age_vs_sleep = plot_ly(data = data, 
                       x = ~data$Physical.Activity.Level, 
                       y = ~data$Sleep.Duration, 
                       color = ~data$Gender,
                       colors = c("#7769f3", "#54bf49"),
                       type = "scatter",
                       mode = "markers")
age_vs_sleep %>% layout(title = "Comparison of Physical Activity Level and Sleep Duration",
                        xaxis = list(title = "Physical Activity Level"),
                        yaxis = list(title = "Sleep Duration"),
                        legend = list(title = list(text = "Gender")))

If we replace gender with occupation, you can see that these two groupings are for women engineers (low activity high sleep) and women nurses (high activity low sleep)

age_vs_sleep2 = plot_ly(data = data, 
                       x = ~data$Physical.Activity.Level, 
                       y = ~data$Sleep.Duration, 
                       color = ~data$Occupation,
                       type = "scatter",
                       mode = "markers")
age_vs_sleep2 %>% layout(title = "Comparison of Physical Activity Level and Sleep Duration",
                        xaxis = list(title = "Physical Activity Level"),
                        yaxis = list(title = "Sleep Duration"),
                        legend = list(title = list(text = "Occupation")))

Distribution of Physcial Activity Level

It is unclear what the distribution is for physical activity level based on the provided data. It could follow a right skewed exponential distribution if the full range of minutes of activity were present. Judging by this chart, we could assume that more people would have less than 30 minutes of activity than people with over 110 minutes of activity.

activity_dist = plot_ly(x = ~data$Physical.Activity.Level, 
                        type = "histogram", 
                        xbins = list(size = 15))
activity_dist %>% layout(title = "Distribution of Physical Activity Level",
                         xaxis = list(title = "Minutes of Activity", range = c(20,110)),
                         yaxis = list(title = "Frequency", range = c(0,90)))

The boxplot of the distribution shows that the physical activity level is nearly perfect distributed between 30 and 90 minutes, with a Q1 of 45 minutes, a median of 60 minutes, and a Q3 of 75 minutes.

lvl = fivenum(data$Physical.Activity.Level)
names(lvl) = c("Min", "Q1", "Median","Q3", "Max"); lvl
##    Min     Q1 Median     Q3    Max 
##     30     45     60     75     90
activity_box_chart = plot_ly(x = data$Physical.Activity.Level, type = "box")
activity_box_chart %>% layout(title = "Distribution of Physical Activity Level",
                          xaxis = list(title = "Minutes of Activity"))

Applicability of Central Limit Theorem on Sleep Duration

The sleep duration ranges from 5.8 hours to 8.5 hours of sleep, with a mean of 7.13 hours and a standard deviation of 0.796

table(data$Sleep.Duration)
## 
## 5.8 5.9   6 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 
##   2   4  31  25  12  13   9  26  20   5   5   3  19  36  14   5   5  10  24  28 
## 7.9   8 8.1 8.2 8.3 8.4 8.5 
##   7  13  15  11   5  14  13
paste("The mean of the population is", round(mean(data$Sleep.Duration),3))
## [1] "The mean of the population is 7.132"
paste("The standard deviation of the population is", round(sd(data$Sleep.Duration),3))
## [1] "The standard deviation of the population is 0.796"

Sleep duration does not have a clear distribution however, based on what we know about sleep, it could be assumed that it will follow a normal distribution, with a drop off in hours of sleep less than 5.5 hours and more than 8.5 hours.

sleep_dur = plot_ly(x = ~data$Sleep.Duration, 
                        type = "histogram", 
                        histnorm = "probability",
                        xbins = list(size = .3))
sleep_dur %>% layout(title = "Distribution of Sleep Duration",
                         xaxis = list(title = "Hours of Sleep", range = c(5.5,9)),
                         yaxis = list(title = "Frequency", range = c(0,.3)))

Drawing samples of 1000 people using the mean and standard deviation of the population, you can see that sample sizes of 25, 50 and 75 people have increasingly narrow ranges centered around the mean with increasingly higher frequency, which proves the Central Limit Theorem.

set.seed(5919)
num_samples = 1000
size1 = 25
size2 = 50
size3 = 75

xmean1 = numeric(num_samples)
  for (i in 1:num_samples) {
    xmean1[i] <- mean(sample(data$Sleep.Duration, size1, replace = FALSE))
  }
  
plot1 = plot_ly(x = xmean1, type = "histogram", histnorm = "probability")
plot1 %>% layout(title = "Sleep Duration Sample of 1000 of Size 25",
                 xaxis = list(title = "Hours of Sleep", range = c(5.5, 9)),
                 yaxis = list(title = "Frequency", range = c(0,.15)))
xmean2 = numeric(num_samples)
for (i in 1:num_samples) {
  xmean2[i] <- mean(sample(data$Sleep.Duration, size2, replace = FALSE))
}

plot2 = plot_ly(x = xmean2, type = "histogram", histnorm = "probability")
plot2 %>% layout(title = "Sleep Duration Sample of 1000 of Size 50",
                 xaxis = list(title = "Hours of Sleep", range = c(5.5, 9)),
                 yaxis = list(title = "Frequency", range = c(0,.15)))
xmean3 = numeric(num_samples)
for (i in 1:num_samples) {
  xmean3[i] <- mean(sample(data$Sleep.Duration, size3, replace = FALSE))
}

plot3 = plot_ly(x = xmean3, type = "histogram", histnorm = "probability")
plot3 %>% layout(title = "Sleep Duration Sample of 1000 of Size 75",
                 xaxis = list(title = "Hours of Sleep", range = c(5.5, 9)),
                 yaxis = list(title = "Frequency", range = c(0,.15)))

Comparison of Sampling Methods and Effect on Mean Sleep Duration

The original data has 374 rows and a mean of 7.13 hours of sleep

N = nrow(data)
paste("The mean of the population is", round(mean(data$Sleep.Duration),3))
## [1] "The mean of the population is 7.132"

Simple Random Sampling

Using simple random sampling of 50 people with replacement gives the following frequencies and mean

set.seed(5919)
n = 50
a = srswr(n, N)
rows1 = (1:N)[a!=0]
simple_random = data[a != 0,]
table(simple_random$Sleep.Duration)
## 
## 5.9   6 6.1 6.2 6.4 6.5 6.6 6.9 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8   8 8.1 8.2 8.4 
##   1   2   6   1   3   3   2   1   3   4   1   1   1   1   1   3   2   3   3   1 
## 8.5 
##   2
paste("The mean using simple random sampling is", round(mean(simple_random$Sleep.Duration),3))
## [1] "The mean using simple random sampling is 7.129"

Systematic Sampling

Using systematic sampling gives the following frequencies and mean

k = floor(N / n)
paste("Sample size of:", k)
## [1] "Sample size of: 7"
set.seed(5919)
b = sample(k,1)
paste("First person selected in first group: ", b)
## [1] "First person selected in first group:  6"
print("All subsequent rows selected:")
## [1] "All subsequent rows selected:"
rows2 = seq(b, by = k, length = n); rows2
##  [1]   6  13  20  27  34  41  48  55  62  69  76  83  90  97 104 111 118 125 132
## [20] 139 146 153 160 167 174 181 188 195 202 209 216 223 230 237 244 251 258 265
## [39] 272 279 286 293 300 307 314 321 328 335 342 349
systematic = data[seq(b, by = k, length = n),]
table(systematic$Sleep.Duration)
## 
## 5.9   6 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 7.1 7.2 7.3 7.4 7.6 7.7 7.8 8.2 8.4 8.5 
##   1   4   5   1   3   1   3   2   1   1   1   6   4   1   1   3   5   2   2   3
paste("The mean using systematic sampling is", round(mean(systematic$Sleep.Duration),3))
## [1] "The mean using systematic sampling is 7.068"

Stratified Sampling by Gender

Using stratified sampling by gender gives the following frequencies and mean. Females and males are nearly equally represented, so the sample sizes for each are the same.

mod_data = data.frame(Gender = data$Gender, Sleep = data$Sleep.Duration)
set.seed(5919)
print("Frequencies of gender in data:")
## [1] "Frequencies of gender in data:"
gender_freq = table(mod_data$Gender); gender_freq
## 
## Female   Male 
##    185    189
print("Strata Sizes:")
## [1] "Strata Sizes:"
strata_sizes = round(n * gender_freq / sum(gender_freq)); strata_sizes
## 
## Female   Male 
##     25     25
stratified = sampling::strata(mod_data,
                              stratanames = c("Gender"),
                              size = strata_sizes, 
                              method = "srswor", 
                              description = TRUE)
## Stratum 1 
## 
## Population total and number of selected units: 189 25 
## Stratum 2 
## 
## Population total and number of selected units: 185 25 
## Number of strata  2 
## Total number of selected units 50
strat_data = sampling::getdata(mod_data, stratified); strat_data
##     Sleep Gender ID_unit      Prob Stratum
## 29    7.9   Male      29 0.1322751       1
## 30    7.9   Male      30 0.1322751       1
## 36    6.1   Male      36 0.1322751       1
## 38    7.6   Male      38 0.1322751       1
## 40    7.6   Male      40 0.1322751       1
## 57    7.7   Male      57 0.1322751       1
## 65    6.2   Male      65 0.1322751       1
## 80    6.0   Male      80 0.1322751       1
## 90    7.3   Male      90 0.1322751       1
## 110   7.4   Male     110 0.1322751       1
## 147   7.2   Male     147 0.1322751       1
## 148   6.5   Male     148 0.1322751       1
## 166   7.6   Male     166 0.1322751       1
## 168   7.1   Male     168 0.1322751       1
## 169   7.1   Male     169 0.1322751       1
## 175   7.6   Male     175 0.1322751       1
## 177   7.6   Male     177 0.1322751       1
## 184   7.8   Male     184 0.1322751       1
## 199   6.5   Male     199 0.1322751       1
## 212   7.8   Male     212 0.1322751       1
## 217   7.8   Male     217 0.1322751       1
## 223   6.3   Male     223 0.1322751       1
## 224   6.4   Male     224 0.1322751       1
## 248   6.8   Male     248 0.1322751       1
## 278   8.1   Male     278 0.1322751       1
## 33    7.9 Female      33 0.1351351       2
## 70    6.2 Female      70 0.1351351       2
## 95    7.2 Female      95 0.1351351       2
## 101   7.2 Female     101 0.1351351       2
## 124   7.2 Female     124 0.1351351       2
## 141   7.1 Female     141 0.1351351       2
## 150   8.0 Female     150 0.1351351       2
## 189   6.7 Female     189 0.1351351       2
## 229   6.6 Female     229 0.1351351       2
## 246   6.5 Female     246 0.1351351       2
## 259   6.6 Female     259 0.1351351       2
## 269   6.0 Female     269 0.1351351       2
## 288   6.0 Female     288 0.1351351       2
## 294   6.0 Female     294 0.1351351       2
## 304   6.0 Female     304 0.1351351       2
## 318   8.5 Female     318 0.1351351       2
## 319   8.4 Female     319 0.1351351       2
## 321   8.5 Female     321 0.1351351       2
## 331   8.5 Female     331 0.1351351       2
## 335   8.4 Female     335 0.1351351       2
## 339   8.5 Female     339 0.1351351       2
## 346   8.2 Female     346 0.1351351       2
## 355   8.0 Female     355 0.1351351       2
## 370   8.1 Female     370 0.1351351       2
## 371   8.0 Female     371 0.1351351       2
paste("The mean using stratified sampling by gender is", round(mean(strat_data$Sleep),3))
## [1] "The mean using stratified sampling by gender is 7.284"

Stratified Sampling by Sleep Disorder

Considering what was learned earlier about the effect sleep disorders have on sleep duration, I’ve also used stratified sampling by sleep disorder. There are more people without sleep disorders so the strata for No is higher.

mod_data2 = data.frame(Disorder = data$Disorder.Exists, Sleep = data$Sleep.Duration)
set.seed(5919)
print("Frequencies of sleep disorder existing in data:")
## [1] "Frequencies of sleep disorder existing in data:"
disorder_freq = table(mod_data2$Disorder); disorder_freq
## 
##  No Yes 
## 219 155
print("Strata Sizes:")
## [1] "Strata Sizes:"
strata_sizes2 = round(n * disorder_freq / sum(disorder_freq)); strata_sizes2
## 
##  No Yes 
##  29  21
stratified2 = sampling::strata(mod_data2,
                              stratanames = c("Disorder"),
                              size = strata_sizes2, 
                              method = "srswor", 
                              description = TRUE)
## Stratum 1 
## 
## Population total and number of selected units: 219 29 
## Stratum 2 
## 
## Population total and number of selected units: 155 21 
## Number of strata  2 
## Total number of selected units 50
strat_data2 = sampling::getdata(mod_data2, stratified2); strat_data2
##     Sleep Disorder ID_unit      Prob Stratum
## 36    6.1       No      36 0.1324201       1
## 37    6.1       No      37 0.1324201       1
## 40    7.6       No      40 0.1324201       1
## 42    7.7       No      42 0.1324201       1
## 44    7.8       No      44 0.1324201       1
## 62    6.0       No      62 0.1324201       1
## 71    6.1       No      71 0.1324201       1
## 86    7.2       No      86 0.1324201       1
## 93    7.5       No      93 0.1324201       1
## 107   6.1       No     107 0.1324201       1
## 121   7.2       No     121 0.1324201       1
## 122   7.2       No     122 0.1324201       1
## 123   7.2       No     123 0.1324201       1
## 135   7.3       No     135 0.1324201       1
## 137   7.1       No     137 0.1324201       1
## 138   7.1       No     138 0.1324201       1
## 144   7.1       No     144 0.1324201       1
## 150   8.0       No     150 0.1324201       1
## 157   7.2       No     157 0.1324201       1
## 168   7.1       No     168 0.1324201       1
## 182   7.8       No     182 0.1324201       1
## 206   7.7       No     206 0.1324201       1
## 211   7.7       No     211 0.1324201       1
## 212   7.8       No     212 0.1324201       1
## 317   8.5       No     317 0.1324201       1
## 321   8.5       No     321 0.1324201       1
## 331   8.5       No     331 0.1324201       1
## 337   8.4       No     337 0.1324201       1
## 360   8.1       No     360 0.1324201       1
## 17    6.5      Yes      17 0.1354839       2
## 19    6.5      Yes      19 0.1354839       2
## 68    6.0      Yes      68 0.1354839       2
## 94    7.4      Yes      94 0.1354839       2
## 105   7.2      Yes     105 0.1354839       2
## 190   6.5      Yes     190 0.1354839       2
## 201   6.5      Yes     201 0.1354839       2
## 221   6.6      Yes     221 0.1354839       2
## 228   6.3      Yes     228 0.1354839       2
## 233   6.6      Yes     233 0.1354839       2
## 240   6.4      Yes     240 0.1354839       2
## 251   6.8      Yes     251 0.1354839       2
## 259   6.6      Yes     259 0.1354839       2
## 282   6.1      Yes     282 0.1354839       2
## 287   6.0      Yes     287 0.1354839       2
## 298   6.1      Yes     298 0.1354839       2
## 346   8.2      Yes     346 0.1354839       2
## 347   8.2      Yes     347 0.1354839       2
## 349   8.2      Yes     349 0.1354839       2
## 361   8.2      Yes     361 0.1354839       2
## 373   8.1      Yes     373 0.1354839       2
paste("The mean using stratified sampling by sleep disorder is", round(mean(strat_data2$Sleep),3))
## [1] "The mean using stratified sampling by sleep disorder is 7.174"

Comparing the mean sleep duration of these four sampling methods to the population average, systematic sampling was the lowest while stratified by gender was the highest. The sampling technique with a mean closest to the population would be simple random sampling with replacement, or stratified by sleep disorder as the next closest.

mean_comp = c(mean(data$Sleep.Duration), 
              mean(simple_random$Sleep.Duration), 
              mean(systematic$Sleep.Duration), 
              mean(strat_data$Sleep),
              mean(strat_data2$Sleep))
names(mean_comp) = c("Population", "SimpleRandom", "Systematic", "Strata_Gender", "Strata_Disorder")
mean_comp
##      Population    SimpleRandom      Systematic   Strata_Gender Strata_Disorder 
##        7.132086        7.128889        7.068000        7.284000        7.174000

Conclusions

We have seen in the data that this group of people get 5.8 hours to 8.5 hours of sleep, with a mean of 7.13 hours, and women get more sleep than men. There may be a correlation to higher physical activity levels and greater sleep duration. More data would be needed, as there were two groupings of women that did not fit this correlation.

We’ve seen that having a sleep disorder is correlated to higher BMI, and higher BMI is also correlated to lower sleep duration. People who get the most sleep have no sleep disorder and a normal BMI.

Despite sleep duration not having a clear distribution in this dataset, the Central Limit Theorem holds true when pulling three different samples of varying sizes from a random sample of 1000.

When uses different sampling methods, the best method would be simple random sampling with replacement, or stratified by sleep disorder.